NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Resource-efficient Inference with Foundation Model Programs

Ni, Lunyiu; Ding, Zhimin; Yu, Kevin; Cheung, Marco; Jermaine, Christopher; Chaudhuri, Swarat (October 2025, Conference on Language Models (COLM) 2025)

Free, publicly-accessible full text available October 7, 2026
Prompt Tuning Strikes Back: Customizing Foundation Models with Low-Rank Prompt Adaptation

Jain, Abhinav; Chaudhuri, Swarat; Reps, Thomas; Jermaine, Christopher (December 2024, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024)

Full Text Available
Prompt Tuning Strikes Back: Customizing Foundation Models with Low-Rank Prompt Adaptation

Jain, Abhinav; Chaudhuri, Swarat; Reps, Thomas W; Jermaine, Christopher M (December 2024, http://papers.nips.cc/paper_files/paper/2024/hash/548551c07a68c8f0a87d67c6167cedb1-Abstract-Conference.html)

Parameter-Efficient Fine-Tuning (PEFT) has become the standard for customising Foundation Models (FMs) to user-specific downstream tasks. However, typical PEFT methods require storing multiple task-specific adapters, creating scalability issues as these adapters must be housed and run at the FM server. Traditional prompt tuning offers a potential solution by customising them through task-specific input prefixes, but it under-performs compared to other PEFT methods like LoRA. To address this gap, we propose Low-Rank Prompt Adaptation (LoPA), a prompttuning-based approach that performs on par with state-of-the-art PEFT methods and full fine-tuning while being more parameter-efficient and not requiring a server-based adapter. LoPA generates soft prompts by balancing between sharing task-specific information across instances and customization for each instance. It uses a low-rank decomposition of the soft-prompt component encoded for each instance to achieve parameter efficiency. We provide a comprehensive evaluation on multiple natural language understanding and code generation and understanding tasks across a wide range of foundation models with varying sizes.
more » « less
Full Text Available
Online Cascade Learning for Efficient Inference over Streams

Nie, Lunyiu; Ding, Zhimin; Hu, Erdong; Jermaine, Christopher; Chaudhuri, Swarat (July 2024, Forty-first International Conference on Machine Learning, ICML 2024)

Full Text Available
Online Cascade Learning for Efficient Inference over Streams

Nie, Lunyiu; Ding, Zhimin; Hu, Erdong; Jermaine, Christopher; Chaudhuri, Swarat (July 2024, International Conference on Machine Learning (ICML))

Large Language Models (LLMs) have a natural role in answering complex queries about data streams, but the high computational cost of LLM inference makes them infeasible in many such tasks. We propose online cascade learning as an approach to address this challenge. The objective here is to learn a “cascade” of models, starting with lower-capacity models (such as logistic regression) and ending with a powerful LLM, along with a deferral policy that determines the model to be used on a given input. We formulate the task of learning cascades online as an imitation-learning problem, where smaller models are updated over time imitating the LLM expert demonstrations, and give a no-regret algorithm for the problem. Experimental results across four benchmarks show that our method parallels LLMs in accuracy while cutting down inference costs by as much as 90% with strong robustness against input distribution shifts, underscoring its efficacy and adaptability in stream processing.
more » « less
Full Text Available

Search for: All records